draft for enable prefill in cudagraph #3354

littledgg · 2025-08-12T09:50:09Z

No description provided.

paddle-bot · 2025-08-12T09:50:13Z

Thanks for your contribution!

gongshaotian · 2025-08-12T11:12:45Z

fastdeploy/worker/gpu_model_runner.py

+        if self.cudagraph_capture_prefill:
+            self.capture_model_prefill()
+


这个加在gpu_worker 里吧和原本的 gpu_model_runner 是平级的

gongshaotian · 2025-08-12T11:14:44Z

fastdeploy/worker/gpu_model_runner.py

@@ -1007,6 +1087,165 @@ def initialize_attn_backend(self) -> None:

        self.attn_backends.append(attn_backend)

+    def _dummy_run_prefill(


整个dummy run都需要重写吗？是不是重写个_dummy_prefill_inputs_prefill就够了

gongshaotian · 2025-08-12T11:16:47Z

fastdeploy/worker/gpu_model_runner.py

+                decode_exists = self.exist_decode()
+                paddle.distributed.all_gather_object(only_prefill_batch_list, not decode_exists)
+                only_prefill_batch = all(only_prefill_batch_list)
+                self.fd_config.parallel_config.moe_phase.phase = "prefill" if only_prefill_batch else "decode"


为什么需要改 moe_phase.phase

gongshaotian · 2025-08-12T11:20:59Z

fastdeploy/worker/gpu_model_runner.py

+        full_length = min(
+            num_tokens // batch_size,
+            self.parallel_config.max_model_len - max_dec_len,
+        )
+
+        # NOTE(wanglongzhi): When the full length is too large, DeepEP's buffer size will not be enough to cause the result to appear nan.
+        # TODO(wanglongzhi): Figure out the accurate buffer size of DeepEP.
+        if self.fd_config.parallel_config.enable_expert_parallel:
+            full_length = min(full_length, 32)
+
+        input_length = int(full_length * self.cache_config.kv_cache_ratio)
+        block_num = (
+            input_length + self.cache_config.block_size - 1
+        ) // self.cache_config.block_size + self.cache_config.enc_dec_block_num


这部分逻辑能确保 input_length 等于想要捕获的 num_tokens 吗

gongshaotian · 2025-08-12T11:23:19Z

fastdeploy/worker/gpu_model_runner.py

@@ -909,6 +967,28 @@ def initialize_forward_meta(self):
            and not (prefill_exists if prefill_exists is not None else self.exist_prefill())


seq_lens_encoder 这个tensor 指针会变，把get block shape kernel 输出的另外几个tensor 也打印出来看下

draft for enable prefill in cudagraph

253ab55

paddle-bot bot added the contributor External developers label Aug 12, 2025

gongshaotian reviewed Aug 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

draft for enable prefill in cudagraph #3354

draft for enable prefill in cudagraph #3354

littledgg commented Aug 12, 2025

Uh oh!

paddle-bot bot commented Aug 12, 2025

Uh oh!

gongshaotian Aug 12, 2025

Uh oh!

gongshaotian Aug 12, 2025

Uh oh!

gongshaotian Aug 12, 2025

Uh oh!

gongshaotian Aug 12, 2025

Uh oh!

gongshaotian Aug 12, 2025

Uh oh!

Uh oh!

		if self.cudagraph_capture_prefill:
		self.capture_model_prefill()

		@@ -1007,6 +1087,165 @@ def initialize_attn_backend(self) -> None:

		self.attn_backends.append(attn_backend)

		def _dummy_run_prefill(

		@@ -909,6 +967,28 @@ def initialize_forward_meta(self):
		and not (prefill_exists if prefill_exists is not None else self.exist_prefill())

draft for enable prefill in cudagraph #3354

Are you sure you want to change the base?

draft for enable prefill in cudagraph #3354

Conversation

littledgg commented Aug 12, 2025

Uh oh!

paddle-bot bot commented Aug 12, 2025

Uh oh!

gongshaotian Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!